Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this final lecture in a series of seven. We've previously covered data structures, data wrangling, exploratory data analysis, flow control, and most recently string search and manipulation but today we will pull from all of those areas to build our own user-defined functions.
At the end of this lecture we will aim to have covered the following topics:
rpy2 package.grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Nothing too big today to use in lecture. We'll just briefly revisit a file from lecture 05.
This is a small example dataset that we'll use briefly to demonstrate context managers.
We're bringing back one of our Lecture 04 datasets so we can reproduces some plots at the end of lecture using some "magic".
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
pandas provides the DataFrame class that allows us to format and play with data in a tabular format.
re provide regular expression matching functions that are similar to those found in the programming language Perl
# Install the rpy2 package
!pip install rpy2
# You may need to RESTART the kernel after running this!
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
# import pandas as pd
# import numpy as np
User defined functions allow you to write your own code as functions that you and others can (re)use. You could achieve similar results by copying and pasting code on-demand, but this is highly prone to errors: You may forget to replace some arguments or variables from a previous run. This may lead to serious issues that can compromise the accuracy and reliability of your code and analyses.
Also, if at some point you realized that your code had an error, you have to go back to every section and script that you wrote in order to update your code. Very tedious and risky, isn't it? Therefore, user-defined or customized functions can be very handy so you do not need to copy-paste code over and over. This is known as the DRY principle: Don't Repeat Yourself.
Here is a list of some of the advantages of writing your own functions:
Disclaimer:
User-defined functions can take time to write, depending on your Python skills and the complexity of the function that you're trying to construct. Thus, as with for loops, consider writing your own functions only when there are no publicly available, ready-to-use functions to do the task that you need to carry out. Be efficient and do not reinvent the wheel! Unless you have the luxury of time and just want to write code for fun.
Here is where we start, and there are plenty of good reasons to start with best practices: Writing functions without following best practices is a recipe for madness. Among the best practices for writing your own functions, documenting your code should always be at the top. In the context of user-defined functions, documentation is known as docstring, a text that appears at the top of a function when ? or the help() function are called. A well-written docstring is expected to contain (at least) the following information:
Docstrings are written within triple double-quotes """. Several docstrings formats are used by the Python community, including Numpydoc, GoogleStyle, reStructuredText, and EpyText. As with variable-naming conventions, regardless of which format you decide to use, the key is to be consistent across docstrings. In other words, pick a format and stick to it. The difference between the several formats is (mainly) in the order in which the parts of a function are documented. Other than that, the formats' contents are very similar.
Numpydoc is the most common docstring format in Python. The documentation for Numpy array is an example of Numpydoc docstring.
import numpy as np
np.array?
Docstring in the Numpydoc can have sections for Description, Parameters, Returns, See Also, Notes, and Examples. If some of your function's parameters are optional, you state so by specifying it in the description of that parameter. For example, my_arg is boolean and is optional, so its docstring would look like:
Parameters: my_arg: bool, optional
To retrieve just a function's docstring, you can use the .__doc__ attribute to get the raw docstring
# grab the array docstring
np.array.__doc__
"array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0)\n\n Create an array.\n\n Parameters\n ----------\n object : array_like\n An array, any object exposing the array interface, an object whose\n __array__ method returns an array, or any (nested) sequence.\n dtype : data-type, optional\n The desired data-type for the array. If not given, then the type will\n be determined as the minimum type required to hold the objects in the\n sequence.\n copy : bool, optional\n If true (default), then the object is copied. Otherwise, a copy will\n only be made if __array__ returns a copy, if obj is a nested sequence,\n or if a copy is needed to satisfy any of the other requirements\n (`dtype`, `order`, etc.).\n order : {'K', 'A', 'C', 'F'}, optional\n Specify the memory layout of the array. If object is not an array, the\n newly created array will be in C order (row major) unless 'F' is\n specified, in which case it will be in Fortran order (column major).\n If object is an array the following holds.\n\n ===== ========= ===================================================\n order no copy copy=True\n ===== ========= ===================================================\n 'K' unchanged F & C order preserved, otherwise most similar order\n 'A' unchanged F order if input is F and not C, otherwise C order\n 'C' C order C order\n 'F' F order F order\n ===== ========= ===================================================\n\n When ``copy=False`` and a copy is made for other reasons, the result is\n the same as if ``copy=True``, with some exceptions for `A`, see the\n Notes section. The default order is 'K'.\n subok : bool, optional\n If True, then sub-classes will be passed-through, otherwise\n the returned array will be forced to be a base-class array (default).\n ndmin : int, optional\n Specifies the minimum number of dimensions that the resulting\n array should have. Ones will be pre-pended to the shape as\n needed to meet this requirement.\n\n Returns\n -------\n out : ndarray\n An array object satisfying the specified requirements.\n\n See Also\n --------\n empty_like : Return an empty array with shape and type of input.\n ones_like : Return an array of ones with shape and type of input.\n zeros_like : Return an array of zeros with shape and type of input.\n full_like : Return a new array with shape of input filled with value.\n empty : Return a new uninitialized array.\n ones : Return a new array setting values to one.\n zeros : Return a new array setting values to zero.\n full : Return a new array of given shape filled with value.\n\n\n Notes\n -----\n When order is 'A' and `object` is an array in neither 'C' nor 'F' order,\n and a copy is forced by a change in dtype, then the order of the result is\n not necessarily 'C' as expected. This is likely a bug.\n\n Examples\n --------\n >>> np.array([1, 2, 3])\n array([1, 2, 3])\n\n Upcasting:\n\n >>> np.array([1, 2, 3.0])\n array([ 1., 2., 3.])\n\n More than one dimension:\n\n >>> np.array([[1, 2], [3, 4]])\n array([[1, 2],\n [3, 4]])\n\n Minimum dimensions 2:\n\n >>> np.array([1, 2, 3], ndmin=2)\n array([[1, 2, 3]])\n\n Type provided:\n\n >>> np.array([1, 2, 3], dtype=complex)\n array([ 1.+0.j, 2.+0.j, 3.+0.j])\n\n Data-type consisting of more than one element:\n\n >>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')])\n >>> x['a']\n array([1, 3])\n\n Creating an array from sub-classes:\n\n >>> np.array(np.mat('1 2; 3 4'))\n array([[1, 2],\n [3, 4]])\n\n >>> np.array(np.mat('1 2; 3 4'), subok=True)\n matrix([[1, 2],\n [3, 4]])"
print(.__doc__) hides the markup syntax and makes it more readable
print(np.array.__doc__)
array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0)
Create an array.
Parameters
----------
object : array_like
An array, any object exposing the array interface, an object whose
__array__ method returns an array, or any (nested) sequence.
dtype : data-type, optional
The desired data-type for the array. If not given, then the type will
be determined as the minimum type required to hold the objects in the
sequence.
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements
(`dtype`, `order`, etc.).
order : {'K', 'A', 'C', 'F'}, optional
Specify the memory layout of the array. If object is not an array, the
newly created array will be in C order (row major) unless 'F' is
specified, in which case it will be in Fortran order (column major).
If object is an array the following holds.
===== ========= ===================================================
order no copy copy=True
===== ========= ===================================================
'K' unchanged F & C order preserved, otherwise most similar order
'A' unchanged F order if input is F and not C, otherwise C order
'C' C order C order
'F' F order F order
===== ========= ===================================================
When ``copy=False`` and a copy is made for other reasons, the result is
the same as if ``copy=True``, with some exceptions for `A`, see the
Notes section. The default order is 'K'.
subok : bool, optional
If True, then sub-classes will be passed-through, otherwise
the returned array will be forced to be a base-class array (default).
ndmin : int, optional
Specifies the minimum number of dimensions that the resulting
array should have. Ones will be pre-pended to the shape as
needed to meet this requirement.
Returns
-------
out : ndarray
An array object satisfying the specified requirements.
See Also
--------
empty_like : Return an empty array with shape and type of input.
ones_like : Return an array of ones with shape and type of input.
zeros_like : Return an array of zeros with shape and type of input.
full_like : Return a new array with shape of input filled with value.
empty : Return a new uninitialized array.
ones : Return a new array setting values to one.
zeros : Return a new array setting values to zero.
full : Return a new array of given shape filled with value.
Notes
-----
When order is 'A' and `object` is an array in neither 'C' nor 'F' order,
and a copy is forced by a change in dtype, then the order of the result is
not necessarily 'C' as expected. This is likely a bug.
Examples
--------
>>> np.array([1, 2, 3])
array([1, 2, 3])
Upcasting:
>>> np.array([1, 2, 3.0])
array([ 1., 2., 3.])
More than one dimension:
>>> np.array([[1, 2], [3, 4]])
array([[1, 2],
[3, 4]])
Minimum dimensions 2:
>>> np.array([1, 2, 3], ndmin=2)
array([[1, 2, 3]])
Type provided:
>>> np.array([1, 2, 3], dtype=complex)
array([ 1.+0.j, 2.+0.j, 3.+0.j])
Data-type consisting of more than one element:
>>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')])
>>> x['a']
array([1, 3])
Creating an array from sub-classes:
>>> np.array(np.mat('1 2; 3 4'))
array([[1, 2],
[3, 4]])
>>> np.array(np.mat('1 2; 3 4'), subok=True)
matrix([[1, 2],
[3, 4]])
Another option is to use the built-in inspect module, which also has many other useful functions to retrieve docstrings. Read more about this package at its documention page.
import inspect
print(inspect.getdoc(np.array))
array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0)
Create an array.
Parameters
----------
object : array_like
An array, any object exposing the array interface, an object whose
__array__ method returns an array, or any (nested) sequence.
dtype : data-type, optional
The desired data-type for the array. If not given, then the type will
be determined as the minimum type required to hold the objects in the
sequence.
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements
(`dtype`, `order`, etc.).
order : {'K', 'A', 'C', 'F'}, optional
Specify the memory layout of the array. If object is not an array, the
newly created array will be in C order (row major) unless 'F' is
specified, in which case it will be in Fortran order (column major).
If object is an array the following holds.
===== ========= ===================================================
order no copy copy=True
===== ========= ===================================================
'K' unchanged F & C order preserved, otherwise most similar order
'A' unchanged F order if input is F and not C, otherwise C order
'C' C order C order
'F' F order F order
===== ========= ===================================================
When ``copy=False`` and a copy is made for other reasons, the result is
the same as if ``copy=True``, with some exceptions for `A`, see the
Notes section. The default order is 'K'.
subok : bool, optional
If True, then sub-classes will be passed-through, otherwise
the returned array will be forced to be a base-class array (default).
ndmin : int, optional
Specifies the minimum number of dimensions that the resulting
array should have. Ones will be pre-pended to the shape as
needed to meet this requirement.
Returns
-------
out : ndarray
An array object satisfying the specified requirements.
See Also
--------
empty_like : Return an empty array with shape and type of input.
ones_like : Return an array of ones with shape and type of input.
zeros_like : Return an array of zeros with shape and type of input.
full_like : Return a new array with shape of input filled with value.
empty : Return a new uninitialized array.
ones : Return a new array setting values to one.
zeros : Return a new array setting values to zero.
full : Return a new array of given shape filled with value.
Notes
-----
When order is 'A' and `object` is an array in neither 'C' nor 'F' order,
and a copy is forced by a change in dtype, then the order of the result is
not necessarily 'C' as expected. This is likely a bug.
Examples
--------
>>> np.array([1, 2, 3])
array([1, 2, 3])
Upcasting:
>>> np.array([1, 2, 3.0])
array([ 1., 2., 3.])
More than one dimension:
>>> np.array([[1, 2], [3, 4]])
array([[1, 2],
[3, 4]])
Minimum dimensions 2:
>>> np.array([1, 2, 3], ndmin=2)
array([[1, 2, 3]])
Type provided:
>>> np.array([1, 2, 3], dtype=complex)
array([ 1.+0.j, 2.+0.j, 3.+0.j])
Data-type consisting of more than one element:
>>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')])
>>> x['a']
array([1, 3])
Creating an array from sub-classes:
>>> np.array(np.mat('1 2; 3 4'))
array([[1, 2],
[3, 4]])
>>> np.array(np.mat('1 2; 3 4'), subok=True)
matrix([[1, 2],
[3, 4]])
Another best practice when writing functions is the "Do One Thing" principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of code snippets. By doing the one thing, your functions become:
Time to start writing our own functions.
To define a function, we need three key pieces plus the docstring:
Function header: The function name is also known as the identifier of the function. Since a function definition is an executable statement its execution binds the function name to the function object which can be called later on using the identifier ().
Parameters: parameters is an optional list of identifiers that get bound to the values supplied as arguments when the function is called. A function may have an arbitrary number of arguments which are separated by commas.
Statement(s): Also known as the function body, are a non-empty sequence of statements executed each time the function is called. This means a function body cannot be empty, just like any indented block.
Docstring: Documentation about the function
Therefore, the structure of a function is:
def function_name(parameters):
"""Docstring information
"""
statement(s)
Let's make our first function print_square() but we'll keep it simple to start before building up to more complex functions. Without any parameters, a function will usually run without any external information and therefore will return the same output in most cases. Of course that assumes you're not making a random number generator - which we aren't!
def print_square(): # function header
"""Return the square of an integer or float
Parameters: None
"""
new_value = 4 ** 2 # function body. 4 to the power of 2, or 4 squared
print(new_value)
# Show us the docstring of the function
help(print_square)
Help on function print_square in module __main__:
print_square()
Return the square of an integer or float
Parameters: None
Because this function has no parameters, the output will always be 16
# Call on our new function
print_square()
16
A hard-coded function such as the one we defined above can be useful in very few occasions. Adding parameters would make it more flexible since it could alter its output based on user-provided input.
Let's redefine print_square() but this time include a parameter value that we will use as the base to our squaring function.
def print_square(value):
"""Print the square of an integer or float
Parameters
----------
value: integer or float, the value to be squared
Returns
-------
out : prints an integer or float, the value ** 2
"""
new_value = value ** 2 # value to the power of 2, or value squared
print(new_value)
This time, running square() without parameters will throw a TypeError
# What happens if we just run square()?
print_square()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-11-60dea6c9dfac> in <module> 1 # What happens if we just run square()? ----> 2 print_square() TypeError: print_square() missing 1 required positional argument: 'value'
# Let's use our function with an integer input
print_square(4)
16
The function now works with any integer or float...
print_square(8.5)
72.25
but not strings!
print_square('string')
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-14-f3ab3fceaeea> in <module> ----> 1 print_square('string') <ipython-input-10-39d90d1322ed> in print_square(value) 9 out : prints an integer or float, the value ** 2 10 """ ---> 11 new_value = value ** 2 # value to the power of 2, or value squared 12 print(new_value) TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
When we tried to run print_square() without providing a value, it generated a TypeError since we were not assigning an appropriate object to the parameter value. You may have noticed throughout this course that a number of the functions and methods we use have default values for some of their parameters. This essentially makes the parameter optional for the user, and provides a default behaviour to the function itself.
You can do the same by assigning default values to your user-defined functions too. Let's update print_square() to have a default value.
# Update our function to have a default value
def print_square(value = 2):
"""Print the square of an integer or float
Parameters
----------
value: integer or float, the value to be squared
Returns
-------
out : prints an integer or float, the value ** 2
"""
new_value = value ** 2 # value to the power of 2, or value squared
print(new_value)
# Now we can call it without parameters
print_square()
# Or with
print_square(3)
4 9
return statement to pass an object reference back from your function¶Instead of printing, we can return the final value of our function. The return statement both ends a function and passes a reference to an object (integer, list, or other items) that has a meaning for both humans and computers. In contrast, when we call on print(), it only prints a string for the human eye, but what is being printed means nothing to the program.
Let's make a new function, return_square() by replacing our print() call with a return statement.
# Update square() to pass a return value instead of print() a result
def return_square(value):
"""Return the square of an integer or float
Parameters
----------
value: integer or float, the value to be squared
Returns
-------
out : integer or float, the value ** 2
"""
new_value = value ** 2 # value to the power of 2, or value-squared
return new_value # return our value
# Run our new return_square() function
return_square(8)
64
# Can we access new_value?
new_value
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-20-be0934324580> in <module> 1 # Can we access new_value? ----> 2 new_value NameError: name 'new_value' is not defined
We attempted to access new_value which only exists within square(). It does not exist in memory after returning the reference. After this, nothing is holding onto the object, and thus it disappears from memory. In order to retain the output from return_square(), we need to assign it to a variable.
# Assign our function's result to a variable
my_new_val = return_square(8)
print(my_new_val)
64
Recall our function print_square() which did not contain an explicit return statement. Instead it ended on a print() statement which still showed us the value of our squaring function. In the case of a function that completes without a return statement, Python will default to returning the None object.
That's right, a function will always return an object - either by explicit call or implicitly by design. Let's check back on our print_square() function.
# Let's check on the object type returned by print_square()
type(print_square(4))
16
NoneType
Here's another example where we literally do nothing with our function!
def do_nothing():
pass # tells Python to "keep going" but there is nothing to be done
print(do_nothing())
None
Remember that the function body is a non-empty sequence of statements. In the example above, the pass statement is an null operation -when you execute it, nothing happens.
Functions are self-contained portions of code that can look, for the most part, like what we've been learning over the last five lectures. That means you can add loops, flow control statements, and even calls to other packages, modules, and user-defined functions within your own function.
Let's use some if and else statements in a new function.
# Define a new function
def many_types(x):
""" Check if a value is less than zero or not
Parameters
----------
x: integer or float, the value we are checking
Returns
-------
out: String, a statement about whether or not the value it less than 0
"""
if x < 0:
return "Yes, x is less than 0!"
else:
return 'nope'
# Check our new function with 1
print(many_types(1))
# and -1
print(many_types(-1))
nope Yes, x is less than 0!
We've already worked with many functions that take in multiple parameters. Defining one is not much harder. Simply include additional parameters separated by a ,. Also, try to give your parameters some meaningful names while you're at it.
# Define a new function that extends return_square()
def raise_to_power(base_value = 1, exponent_value = 1):
# generate a value to return
result = base_value ** exponent_value
# Can we return just the calculation as a result?
# return result
return (base_value ** exponent_value)
# Use raise_to_power and name the parameters
raise_to_power(exponent_value = 2, base_value = 3)
raise_to_power(2, 3)
9
8
Did you notice that we swapped the positions of the two arguments when we called raise_to_power()? It is possible to change the order of the arguments without affecting the results if you use the name of the parameters, in this case, base_value and exponent_value. If you swap the argument positions without assigning the parameters, Python will assign the arguments in the order defined by your function.
# Raise 2 to the power of 3
raise_to_power(2, 3)
8
Here is a function that returns the dot product of two matrices
# Define our dot-product function
def product(A, B): # two parameters
"""Performs matrix multiplication"""
result = np.dot(A, B)
return result
import numpy as np
matrix_1 = np.array([[1,2],
[3,4]])
matrix_2 = np.array([[5,6],
[7,8]])
product(A = matrix_2, B = matrix_1) # two arguments
array([[23, 34],
[31, 46]])
* to create an arbitrary number of positional arguments¶You can write functions that can take any number of arguments by adding *args as the only parameter - the syntax is the asterisk (*); args is just a word used by convention. Recall we've seen this with variable before with the unpacking/packing operator back in lecture 04? It's back to help pack multiple arguments into a single parameter.
Be aware that *args cannot be assigned default arguments of any kind; just use it as-is. The example below runs a print() for loop on every argument passed to the function.
# Define a function that contains a for loop that prints the unpacked arguments
# args will be a tuple containing all values that are passed in
def func(*args):
for i in args:
print(i)
# Calling it with 3 arguments
func(1, 2, 3, 4, 5, 6)
1 2 3
* to pass elements from an iterable¶As you've probably guessed by now, you can also pass values to func() as a list or other iterable. The key to treating each element separately is to use the * unpacking operator here as well. It will break out each element of the list into a separate object assigned to *args - which itself is a tuple object.
# Make a list of arguments
list_of_arg_values = [1, 2, 3]
# Run func() using the list as-is
print("Run func() without unpacking values")
func(list_of_arg_values)
print("\nRun func() with unpacked values")
# Run func() with the unpacked list
func(*list_of_arg_values)
Run func() without unpacking values [1, 2, 3] Run func() with unpacked values 1 2 3
You can also pass a predefined array preceded by an asterisk.
import numpy as np
array_1 = np.array([1, 2, 3, 4])
func(*array_1) # prints each element separately, as you would expect when the function was written
1 2 3 4
** to assign an arbitrary number of keyword arguments¶In our above example we were using positional arguments because we were not naming any of the values explicity in the func() call. In section 1.7.0 we assigned parameter values by their keywords.
In case you need a function with an arbitrary number of keywords, **kwargs will take care of that for you. **kwargs allows you to handle named arguments that have not been defined in advance. This means you could send an arbitrary (optional) group of named arguments to your function. This also serves the purpose of shortening the defining header for your function - especially if you plan to have a lot of parameters. Much like using *args this also provides some flexibility to your function in cases where the amount of information you need can vary by situation.
Again the key to making this work is with ** in your function definition. kwargs is simply the standard naming convention but not strictly required as the variable name.
Here is a function that returns a dictionary key:value pairs
# Define our dictionary
dictionary_aminoacids = {"Alanine": {"Ala", "A", "GCA GCC GCG GCT"},
"Cysteine": {"Cys", "C", "TGC, TGT"},
"Aspartic acid": {"Asp", "D", "GAC GAT"},
"Glutamic acid": {"Glu", "E", "GAA GAG"},
"Phenylalanine": {"Phe", "F", "TTC TTT"},
"Glycine": {"Gly", "G", "GGA GGC GGG GGT"},
"Histidine": {"His", "H", "CAC CAT"},
"Isoleucine": {"Ile", "I", "ATA ATC ATT"},
"Lysine": {"Lys", "K", "AAA AAG"},
"Leucine": {"Leu", "L", "TTA TTG CTA CTC CTG CTT"},
"Methionine": {"Met", "M" "ATG"},
"Asparagine": {"Asn", "N", "AAC AAT"},
"Proline": {"Pro", "P", "CCA CCC CCG CCT"},
"Glutamine": {"Gln", "Q", "CAA CAG"},
"Arginine": {"Arg", "R", "AGA AGG CGA CGC CGG CGT"},
"Serine": {"Ser", "S", "AGC AGT TCA TCC TCG TCT"},
"Threonine": {"Thr", "T", "ACA ACC ACG ACU"},
"Valine": {"Val", "V", "GTA GTC GTG GTT"},
"Tryptophan": {"Trp", "W", "TGG"},
"Tyrosine": {"Tyr", "Y," "TAC TAT"}
}
# Make a function with the **kwargs parameter
def func(**kwargs): # kwargs will be a dictionary containing the names as keys and the values as values
# Generate an iterator from the kwargs dictionary
for key, value in kwargs.items():
# Print the values with the .format function substitute
print('{0} = {1}'.format(key, value))
# Call our function and pass along our amino acid dictionary
func(**dictionary_aminoacids)
Alanine = {'Ala', 'A', 'GCA GCC GCG GCT'}
Cysteine = {'C', 'Cys', 'TGC, TGT'}
Aspartic acid = {'D', 'Asp', 'GAC GAT'}
Glutamic acid = {'GAA GAG', 'Glu', 'E'}
Phenylalanine = {'TTC TTT', 'Phe', 'F'}
Glycine = {'Gly', 'G', 'GGA GGC GGG GGT'}
Histidine = {'His', 'CAC CAT', 'H'}
Isoleucine = {'I', 'ATA ATC ATT', 'Ile'}
Lysine = {'AAA AAG', 'Lys', 'K'}
Leucine = {'TTA TTG CTA CTC CTG CTT', 'L', 'Leu'}
Methionine = {'MATG', 'Met'}
Asparagine = {'AAC AAT', 'Asn', 'N'}
Proline = {'P', 'CCA CCC CCG CCT', 'Pro'}
Glutamine = {'Gln', 'Q', 'CAA CAG'}
Arginine = {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}
Serine = {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}
Threonine = {'T', 'Thr', 'ACA ACC ACG ACU'}
Valine = {'GTA GTC GTG GTT', 'Val', 'V'}
Tryptophan = {'W', 'Trp', 'TGG'}
Tyrosine = {'Tyr', 'Y,TAC TAT'}
**kwargs, you are a dictionary¶In our above examples we created a dictionary object dictionary_aminoacids and passed that to the function with the ** operator. Is this the reason why we can treat kwargs as a dictionary object? No.
Regardless of how we pass arguments to it through our function, the **kwargs parameter _is a dictionary object. That means when we try to generate an iterator from it, it follows the same rules.
keys(), values(), and items() methods.# Let's use a few different keywords and see how func works
func(first = 1, second = 2, third = "three", that = 1.2)
first = 1 second = 2 third = three that = 1.2
# Make a function with the **kwargs parameter
def func(**kwargs): # kwargs will be a dictionary containing the names as keys and the values as values
# Generate a value iterator from the kwargs dictionary
# Are we doing this correctly?
for value in kwargs:
# Print the values from kwargs
print(value) # Oh no, it's the hash!
# Test our new function
func(junk = 4, **dictionary_aminoacids)
junk Alanine Cysteine Aspartic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine
In Lecture 05 we first encountered the lambda keyword which creates an in-line function that contains a single expression. Recall that an expression can be evaluated to a value while the lambda function cannot contain statements. The value of this expression is what the function returns when invoked.
Consider the following function, greeting() which we will define formally.
# Define a function called greeting
def greeting():
return "Hello"
# Call on our greeting
greeting()
'Hello'
This particular function can also be written as a lambda function as follows:
greet_me = lambda: "Hello"
greet_me()
'Hello'
In the above example and in lecture 06 we saw some examples where we assigned our lambda function to a variable. This was done to help set up our examples for clarity when the lambda function was especially long.
The purpose of the lambda function, however, is to remain anonymous. This is for a number of reasons including transparency for those reading your code. Notice there is no docstring setup in your lambda function which means it is not simply queried when someone is reading through your code. If you have assigned your lambda elsewhere, a reader will not easily know its purpose. Whereas inserting the lambda function as it is meant to be, inline with your code, makes its function far easier to ascertain.
Question: What if I keep using the same or similar lambda code in different parts of my code?
Answer: Refactor your work and make a function that you can call on specifically. It will reduce potential errors and simplify your code in the long run. Do not assign your lambda to a variable.
Here is a simple anonymous function that shows clearly how lambda works.
# Note our use of the map function?
list(map(lambda x: x * 2, [1, 2, 3, 4, 5]))
[2, 4, 6, 8, 10]
map() to link objects to functions¶In our above code, x Takes the value of each element in the list and multiplies it by 2. The list( ) and map( ) are there just to let us see the output of the in-line function. Somehow it resembles a for loop, don't you think?
The map() function takes on the form:
map(function, iterable[, iterable1, iterable2, ..., iterablen])
and applies the function object to each iterable within the list. This could be a pre-defined function, method or other Python "callable" but it is called without the () parentheses OR it could be an anonymous lambda function as we saw above. The map() function returns a type of iterator with the transformed elements. This takes the place of using an explicit for loop but you still have to deal with the iterator after it's returned.
Here's an example of map() on it's own.
# Map our square function to an integer list
squared = map(return_square, [2,4,6,8])
# What is returned
type(squared)
# Unpack the iterator
print(*squared, sep=", ")
map
4, 16, 36, 64
# OR Assign it to a list
squared = map(return_square, [2,4,6,8])
list(squared)
[4, 16, 36, 64]
filter() to link an iterable to a filtering function¶Much like map(), the filter() function (not to be confused with pandas.DataFrame.filter()) can be used to quickly filter through a single iterable. While some objects that we've worked with can be filtered on an element-by-element basis (Numpy arrays and Pandas DataFrames), core Python iterables haven't been as easily filtered... until now!
The filter() function takes on the form:
filter(function, iterable)
and applies function which is a decision function typically returning a boolean result. While there are other use-cases for filter() we won't get into those today. Unlike map(), the filter() function can only evalulate a single iterable but it will still return an iterator that maps the function to iterable elements to produce a boolean result for each. You can then treat this iterator like any other to evaluate the iterable elements individually or as a whole.
And yes, our function can be a lambda function too! The following code combines filter() with a lambda function to output each element of the list that starts with the letter "d".
# Let's list our favourite animals
list_1 = ['wolf', 'sheep', 'duck', 'dolphin']
# filter returns an iterator yielding those items of iterable for which function(item) is true.
list(filter(lambda x: x.startswith('d'), list_1))
['duck', 'dolphin']
# Make a function with the *args and **kwargs parameter
def comp_func(*args, **kwargs):
for key in kwargs:
if(key == "cum_sum" and kwargs[key] == True):
print("The cumulative sum of *args is:", sum(args))
elif (key == "product" and kwargs[key] == True):
product = 1
for i in args:
product = product * i
print("The combined product of *args is:", product)
else:
print("Function", key, "was not activated")
Up until now, we've been moving through our code without a care in the world. Recall that deep in the background, every time we declare a variable, it is assigned a place in memory. Whenever we assign an object to a variable, it acts as a tether to wherever that object exists in memory. We've called this lifeline the reference to the object. When we re-assign a variable to a new object, that original link is essentially destroyed.
In Python, and most other modern programming languages, whenever all references to an object have been destroyed, the object itself is (eventually) removed from memory. As another example, we saw instances of scope usage when declaring variables just outside a for loop. While things run slightly differently in the Jupyter Notebook
Scope can refer to multiple things but in Python it primarily is focused on
Both are somewhat intertwined within the use of scope which changes when we use functions.
In a very simplified way, functions can be thought of as separate rooms in a house or sandboxes in a playground. When we call on these functions, we are entering a clean version of these rooms which are built with some knowledge of the basic house architecture.
Thus a variable is usually either global or local in scope. If it is local, then the information about it simply disappears at the end of the function. The scope of a variable can usually be considered as part of the same indentation of a function's section. After you've exited that section, anything explicitly declared within (ie new variables from that section) will be released from memory unless it has been passed on in a return statement.
Of course, Python also includes an encapsulating or non-local scope that allows named variables in nested functions to be seen by their encapsulating scope (like a room within a room). There are also other keywords that allow a program to see the global scope from a local context. We can't * (unpack) all of that here.
Why is scope important?
Understanding the basics of this concept will save you a lot of troubles down the road as you make more and more complex programs. You'll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let's revisit our examples from above.
# A quick example about scope and namespaces
global_value = 100
def scoping_func():
local_value = 42
print("We can access global: {} and local: {} values".format(global_value, local_value))
# call on our function
scoping_func()
We can access global: 100 and local: 42 values
We've already seen this happening in our previous examples of print_square() and return_square(). In both cases we used the same variable name value without any issue. That's because each of those functions has their own namespace and neither can interact with each other, in the way they've been coded.
Let's try a more complex example where we are reusing a global variable inside a local context. Let's alter the value of global_value in our example from within the function by assigning it as an integer.
# You can reuse variable names when changing scope.
global_value = 100
def scoping_func():
local_value = 42
global_value = 0 # We're actually defining a new variable locally
print("We can alter global: {} and local: {} values".format(global_value, local_value))
# call on our function
scoping_func()
print("But outside the function global_value is still {}".format(global_value))
We can alter global: 0 and local: 42 values But outside the function global_value is still 100
Yes, you can assign a globally declared variable's name to a locally declared variable but you cannot directly change the value of the global variable. This comes down to namespaces again and the order of operations when we try to assign a variable. Let's do one last example and see what happens when we try to update the value of global_value.
# Try to update global_value from inside your function
global_value = 100
def scoping_func():
local_value = 42
global_value = global_value + 1 # Let's increment global_value by 1
print("We can alter global: {} and local: {} values".format(global_value, local_value))
# call on our function
scoping_func()
print("But outside the function global_value is still {}".format(global_value))
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) <ipython-input-52-16b3f24ecaa0> in <module> 10 11 # call on our function ---> 12 scoping_func() 13 print("But outside the function global_value is still {}".format(global_value)) <ipython-input-52-16b3f24ecaa0> in scoping_func() 5 6 local_value = 42 ----> 7 global_value = global_value + 1 # Let's increment global_value by 1 8 9 print("We can alter global: {} and local: {} values".format(global_value, local_value)) UnboundLocalError: local variable 'global_value' referenced before assignment
Unfortunately we cannot run from this error. It's a Python design choice. When Python compiles all of the code into a computer-level language it builds the namespace and sees global_value = ... within the function. This means it will treat global_value as a local variable and even if we broke up the assignment moved it around, this fact will not change.
If you really need to alter global_value you'll want to pass it in as an argument and return a value instead that you can assign directly to global_value. This is the correct way to accomplish altering a global variable within a function.
# Pass your variable and re-assign it as a function's output
global_value = 100
def scoping_func(global_value):
local_value = 42
return_value = global_value + local_value # Let's increment global_value by 1
print("We can see global: {} and local: {} values".format(global_value, local_value))
return return_value
# call on our function
global_value = scoping_func(global_value)
print("And re-assign global_value to {}".format(global_value))
We can see global: 100 and local: 42 values And re-assign global_value to 142
| Now that you're ready, it's time to put it all together with some examples |
Here is an example of a user-defined function using data from lecture 6 (Regular Expressions). The starting point was a DNA sequence.
Here we'll also use the \ as a form of line-continuation. Unlike the ''' triple-quote, this will not insert additional newlines into the string.
DNA = 'GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA'\
'GCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGC'\
'TCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCAT'\
'TTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGG'\
'CCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCC'\
'ATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATG'\
'GACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTG'\
'GAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTT'\
'CGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCC'\
'CTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGG'\
'AGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
# Output our DNA sequence
DNA
'GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
To translate that DNA into mRNA, we wrote a lambda function
import re
# Translate our DNA to mRNA
mRNA = re.sub('.', # pattern
lambda m: {'G':'C', 'C':'G', 'A':'U', 'T':'A'}.get(m.group(), "X"), # repl
DNA # string
)
# this does what we wanted
# Show the output of our lambda function
mRNA
'CGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUUUAGCUGCGCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGAGGGAGCACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCUUCGCACCGACGAGUGCGACAUGGAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACCCGACACACGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGGGCCAUUUCAUCCUGUCCACGGCCGUCGCGAGACCCAGUAAAAGCCGCUCCUGGCGAAAGCGACCUCUAGCCGGACAGCGAACGCCAUAAGCCUUAGAACGUGCGGGAGCGAGUUCGGAAGCAGUGAGGUUUGCAAAGCCGCUCUUCGUCCGGUAAUAGCGGCCGUACCGCCGGCUGCGCGACCCGACCGCAAGCGCUGCGCUCCGACCUACCGGAAGGGGUAAUACUAAGAAGAGCGAAGGCCGCCGGGCGCAACGUCCGGUACGACAGGUCCGUCCAUCUACUGCUGGUAGUCCCUGUCGAAGUUGCCGAGAAUGGUCGGAUUGAAGCUAGUGACCUGGCGACUAGCAGUGCCGCUAAAUACGGCGUGUACCUGCGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUGUUCAGUCUCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGCGAGAGGACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCGAAAGAGUUACGAGUGCGACAUCCAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACUGCUUGGGGGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGUGCUGAAUUGCCCAACCGUACCUAACAUCCGCGGCGGGAUAUGGAACAGACGGAGGGGCGCCACGUACCUCGGCCCGGUGGAGCUGGACUUACCUUCGGCCGCCGUGGAGCGAUUGCCGGUUCUUAACCUCGGUUAGUUAAGAACGCCUCUUGACACUUACGCGUUUGGUUGGGAACCGGUAGCGCAGGCGGUAGAGGUCGUCGGCGUGCGCCGCGUAGAGCCCGUCGCAACCCAGGA'
Let's break down the lambda function above:
re.sub( ) takes the following arguments (in the order they appear): pattern, repl, string. Here are each one of the arguments we are passing:
pattern = ., which as you know, stands for "any character" is the regular expression we want to use to identify our patternrepl = replacement string which is what we are writing as a lambda function. It will find the keys and replace them with their corresponding values.string = DNA is the string that we want to search and modify.The entire lambda function is contained within re.sub( ). Notice that we did not save the function; we simply use it "on the fly" to perform the translation of DNA into mRNA.
m is a placeholder that can take any form (Python object). In this case, m is a re.Match() object which we know has the group() method to get the string sequence that matched our pattern.group( ) method becomes an argument for the dictionary .get( ) which will return the value of our key m.group().After translation into mRNA, we further translated the ribonucleotide sequence into amino acids based on their codons. For that purpose, we created a dictionary with codons as keys and encoded amino acids as values.
# mRNA to Protein
translation_aminoacids = {
'UUU':'F', 'UUC':'F', # Phenylalanine
'UUA':'L', 'UUG':'L', 'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L', # Leucine
'AUU':'I', 'AUC':'I', 'AUA':'I', # Isoleucine
'AUG':'M', # Methionine
'GUU':'V' , 'GUC':'V', 'GUA':'V', 'GUG':'V', # Valine
'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S', # Serine
'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', # Proline
'ACU':'U', 'ACC':'U', 'ACA':'U', 'ACG':'U', # Threonine
'GCU':'A', 'GCU':'A', 'GCA':'A', 'GCG':'A', # Alanine
'UAU':'Y', 'UAC':'Y', # Tyrosine
'UAA':'', 'UAG':'', 'UGA':'',# Stop. We will translate stop codons into nothing for this excercise
'AUG':'START', # Start
'CAU':'H', 'CAC':'H', # Histidine
'CAA':'Q', 'CAG':'Q', # Glutamine
'AAU':'N', 'AAC':'N', # Asparagine
'AAA':'K', 'AAG':'K', # Lysine
'GAU':'D', 'GAC':'D', # Aspartic acid
'GAA':'E', 'GAG':'E', # Glutamic acid
'UGU':'C', 'UGC':'C', # Cysteine
'UGG':'W', # Tryptophan
'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', # Arginine
'AGU':'S', 'AGC':'S', # Serine
'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G' # Glycine
}
Last week we used a few different methods to convert our mRNA sequence to amino acids. In both instances we used list comprehension to achieve our goal. Now let's put together a user-defined function that we can call on from anywhere in the program.
We'll call it translate() and it will have two parameters:
dict the amino acid dictionary we want to translate with.seq the mRNA sequence we wish to translate.Within our function we'll also use two re functions we didn't cover in class last week:
re.compile(): Used to create a regular expression object (or pattern) that can be reused several times throughout a program. Through it, you can access many of the same re functions we've previously discussed. The big difference is you don't need to supply a pattern as it will substitute itself in directly.re.escape(): Used to insert escape characters within a pattern that are metacharacters. This basically assumes the string your provide is the literal translation, and adds in escape characters so that it will be interpreted properly as a regex pattern.# Define our function translate()
def translate(dict, seq):
# Build a regular expression object
regex = re.compile('|'.join(map(re.escape, dict.keys())))
# return a call to sub()
return regex.sub(lambda x: dict[x.string[x.start():x.end()]], # Our replacement string
seq)
# Run our function
print(translate(translation_aminoacids, mRNA))
RNDRKKVSEAGGLLVVFLAAPPLWAVLIFLWSAKGDLRGSURLGRRSTARTGPSTARTDRRKEGSPSHRRVRHGSQGPHPASEVRPDURQVGLAURNRPLIAELRLGHFILSUGPSRDPVKGPLLAKAUSSRUANGPGPNVRERVRKQGLQSRSSSGNSGRUGPAARPDRKRCAPUYRKGYEERRPPGAUSGUUGPSIYCWSLSKLPRSTARTVGLKLVUWRLAVPLNUACUCAUUAKRYPRRGDCSCLFSLHRFGLSYFYGPQRGUFARGQGWDGEWPSTARTDRRKEGSPKELRVRHPSQGPHPASEVRLLGGQVGLAURNRPLIAELRLCIGPNRUHPRRDSTARTEQUEGRHVPRPGGAGLUFGRRGAIGPVLNLGLRUPLDUYAFGWEPVAQAVEVVGVRRVEPVAUQGA
translate() function¶Here is what the function translate() does:
dict) and a string (seq) as arguments which clearly correspond to translation_aminoacids and mRNA respectively.re.compile() will create a regular expression object from a pattern. We can then use many of the re functions as methods on the object, without including a pattern as an argument.re.escape is mapped to dict.keys() in order to remove any special escape characters from our initial dictionary. In this case there are none. This map object is then passed to .join(), with the final result being a |-separated list of codons. This is assigned to regex.regex calls on the .sub( ) method, where another lambda function creates queries to our translation dictionary's hash-table as the value for the repl parameter, using the sub() code which supplies a re.Match object to the x variable.Overall the code does not seem as obscure as it looked the first time you saw it, right? Admittedly it is still a bit of a Rube Goldberg code-wise. Here's a cleaner version of that code:
# Define our function translate()
def translate(dict, seq):
# Build a regular expression object
regex = re.compile('|'.join(dict.keys()))
# return
return regex.sub(lambda x: dict[x.group()], # Our replacement string
seq) # Our search string
# Run our function
print(translate(translation_aminoacids, mRNA))
RNDRKKVSEAGGLLVVFLAAPPLWAVLIFLWSAKGDLRGSURLGRRSTARTGPSTARTDRRKEGSPSHRRVRHGSQGPHPASEVRPDURQVGLAURNRPLIAELRLGHFILSUGPSRDPVKGPLLAKAUSSRUANGPGPNVRERVRKQGLQSRSSSGNSGRUGPAARPDRKRCAPUYRKGYEERRPPGAUSGUUGPSIYCWSLSKLPRSTARTVGLKLVUWRLAVPLNUACUCAUUAKRYPRRGDCSCLFSLHRFGLSYFYGPQRGUFARGQGWDGEWPSTARTDRRKEGSPKELRVRHPSQGPHPASEVRLLGGQVGLAURNRPLIAELRLCIGPNRUHPRRDSTARTEQUEGRHVPRPGGAGLUFGRRGAIGPVLNLGLRUPLDUYAFGWEPVAQAVEVVGVRRVEPVAUQGA
That's a much cleaner function now.
# Define our function translate()
def translate(dict, seq):
"""The translate function converts mRNA sequence to amino acid sequences
Parameters
----------
dict: ...
seq: ...
Returns
-------
out: ...
"""
# Build a regular expression object
regex = re.compile('|'.join(dict.keys()))
# return
return regex.sub(lambda x: ...), # Our replacement string
...) # Our search string
# Print just the docstring information
print(translate.__doc__)
# Or this
# help(translate)
As part of your new job, you download a .csv file from a database twice a day. These files are compressed, therefore you need to extract them before use. Historically, staff have downloaded these files manually but you identify this as something that can be easily automated, so you decided to write a function to do the job for you and your team.
First, let's write the code for each task:
Our target file is hosted in https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz
We'll begin by creating a download folder using the os package. We can use os.makedirs to create all non-existent folders within the path we want to make.
# Import the os package
import os
# Define the path to download the data
HOUSING_PATH = os.path.join("datasets", "housing") # path to store data in
# Create a directory to store the data
os.makedirs(HOUSING_PATH, exist_ok=True)
# True will ovewrite the folder if already exists. Careful because no warning will be issued
We'll now create a string path to the website where we want to download from. We'll use the urlretrieve() function from the urllib package to help pull down the data.
# Where is our content located
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/" # Web page containing the dataset
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz" # this is the full web path
# Download the data
import urllib
# full path to save housing.tgz
tgz_path = os.path.join(HOUSING_PATH, "housing.tgz")
# Retrieve the dataset using the URL we defined
urllib.request.urlretrieve(HOUSING_URL, tgz_path)
('datasets\\housing\\housing.tgz', <http.client.HTTPMessage at 0x1fb95969580>)
Now that we've downloaded the file to our new directory, we can go ahead and extract it from the .tgz container. We'll import another package, tarfile to help us open and extract the file. To do so we'll need to open() a connection to the file and then extractall() of its contents.
#Extract file
import tarfile
tgz_path = os.path.join(HOUSING_PATH, 'housing.tgz') # the file we are interested in
# Open a connection to the path
housing_tgz = tarfile.open(tgz_path)
# Extract all the contents
housing_tgz.extractall(path=HOUSING_PATH)
# Close the connection
housing_tgz.close()
Now that the extracted data is sitting there, we can load it using the functions from pandas so we can work with it as a DataFrame.
# load the data
import pandas as pd
# Make a path to the data
csv_path = os.path.join(HOUSING_PATH, "housing.csv")
# Read the data in
data = pd.read_csv(csv_path)
# Take a peek at the data
data.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Now you have a script that downloads the data and loads it to the environment. Let's split those four code blocks into two functions: One to fetch the data and another to load the data.
Let's start by fetching the data with a function named fetch_data()
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
DATA_PATH = os.path.join("datasets", "housing")
DATA_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_data(data_url = DATA_URL, data_path = DATA_PATH):
# Make the download directories
os.makedirs(data_path, exist_ok=True)
# Set the path for the target file
tgz_path = os.path.join(data_path, "housing.tgz")
# Retrieve the target file
urllib.request.urlretrieve(data_url, tgz_path)
# Open a connection to the downloaded target
housing_tgz = tarfile.open(tgz_path)
# Extract it's contents
housing_tgz.extractall(path=data_path)
# Close the connection
housing_tgz.close()
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
DATA_PATH = os.path.join("datasets", "housing")
DATA_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
fetch_data(data_url = DATA_URL, data_path = DATA_PATH)
# fetch_data()
Next, create the function that loads the data
import pandas as pd
def load_data(data_path=DATA_PATH):
# Create a path to the csv file
csv_path = os.path.join(data_path, "housing.csv")
# Read the file
return pd.read_csv(csv_path)
# Load the data using our new function
housing = load_data(data_path = DATA_PATH)
housing.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
All of our functions are completed now so let's plot some of the data.
# %matplotlib inline works only in a Jupyter notebook
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
array([[<AxesSubplot:title={'center':'longitude'}>,
<AxesSubplot:title={'center':'latitude'}>,
<AxesSubplot:title={'center':'housing_median_age'}>],
[<AxesSubplot:title={'center':'total_rooms'}>,
<AxesSubplot:title={'center':'total_bedrooms'}>,
<AxesSubplot:title={'center':'population'}>],
[<AxesSubplot:title={'center':'households'}>,
<AxesSubplot:title={'center':'median_income'}>,
<AxesSubplot:title={'center':'median_house_value'}>]],
dtype=object)
This housing dataset contains a latitude and longitude variable, so let's render it as a map, using the median_house_value and population variables to see what are the most populated areas (warmest colors)
# Make a classic x/y scatter plot
# Use longitude as x and latitude as y
housing.plot(kind="scatter", # What kind of plot is it?
x="longitude", y="latitude", # Set the source of x/y values
alpha=0.4, # Make our dots a little more transparent
s=housing["population"]/100, label="population", # Scale population by 100 and label
figsize=(10,7), # Set the figure size
c="median_house_value", # Set the colour based on median hour value
cmap=plt.get_cmap("jet"), colorbar=True, # Create a colourbar
)
plt.legend()
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
<matplotlib.legend.Legend at 0x1fb99afa2b0>
So, from now on, all you need to do to download and work on your data is to define the link to the data warehouse, what directories you want to access, and run fetch_data() and load_data()
Throughout this entire course, we've been focused on just getting the basics of Python. To review some of the topics we've covered include:
All of these topics help to build a foundation for analysing your data but also provide you with an understanding of the Python language so you can move forward with learning new packages that are relevant to your own research. For instance, knowing what a list or tuple is or how to unpack an interable will simplify learning new coding paradigms.
In this last section, we'll look beyond Python. For those with experience or planning to work with the programming language R, or with command-line programs for larger sequence analysis steps.
rpy2¶While Python is very dynamic with a wide number of packages available, sometimes you might feel more comfortable working with something familiar. For instance, the ggplot2 package is de facto visualization package in R. Additional packages are built with ggplot2 as a basis, expanding on it's grammar-of-graphics style. Of course there is a relatively recent Python version of this package called plotnine. Comparing these two, there is an active community of over 250 contributors working on ggplot2 to update and improve it on a regular basis while plotnine is developed and maintained by a single individual.
Taking that into account, you may want to visualize your data with R rather than python! The same could be said for other standard bioinformatics packages from R, whose functionality might be useful, like DeSeq2 for differential expression analysis of RNA-Seq data.
Rather than splitting your time between two programming languages, and their development environments, you can run it all from a single script with the rpy2 package. Let's look at the two ways we can use this package to our advantage.
taxa_pitlatrine.csv¶Before we jump into working with rpy2 let's bring in some interesting data to plot for our example. We'll bring back our friend taxa_pitlatrine.csv which we worked with back in Lecture 04. The following code cell will import, and filter the data just like we did last week.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Load the pitlatrine data
latrine_OTU_data = pd.read_csv("data/taxa_pitlatrine.csv")
# Filter the data
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>0)] \
.assign(OTU_log=latrine_OTU_data.loc[:,"OTUs"].apply(np.log)) \
.sort_values("Taxa") # Sort your data by Taxa
# Take a look at the start of the data
latrine_filtered.head()
| Taxa | Country | Latrine_Number | Depth | OTUs | OTU_log | |
|---|---|---|---|---|---|---|
| 3848 | Acidobacteria_Gp1 | Vietnam | 7 | 2 | 1 | 0.000000 |
| 676 | Acidobacteria_Gp1 | Tanzania | 4 | 5 | 1 | 0.000000 |
| 3900 | Acidobacteria_Gp1 | Vietnam | 7 | 3 | 7 | 1.945910 |
| 4056 | Acidobacteria_Gp1 | Vietnam | 9 | 2 | 18 | 2.890372 |
| 3172 | Acidobacteria_Gp1 | Vietnam | 22 | 1 | 1 | 0.000000 |
rpy2 to import packages from the R library¶To begin with, the rpy2 package provides a function importr that allows you to import R packages for use in Python by assigning them to variables/aliases. For everything to work, we'll actually need to import a number of packages and modules. More specifically we'll be importing:
rpy.robjects: This will give us the ability to directly access base R objects and values like pi.rpy2.robjects.packages.importr: This will provide us with functionality to import other R packages.# Begin by importing the R objects handler
import rpy2.robjects as ro
# Import the `rpy2` importer
from rpy2.robjects.packages import importr
# Import some basic functions for R
base = importr('base')
utils = importr('utils')
# We can access basic R values like pi
base.pi
# Or check on our current directory
print(base.getwd())
| 3.141593 |
[1] "C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Introduction_to_Python/2022.01_Intro_Python/Lecture_07_User_Functions"
pandas.DataFrame to R data.frame¶While the pandas and base R versions of data frame objects share significant overlap in behaviour and design, they are still distinct object types. In order to use R packages, we'll have to make the conversion from one to the other. This, itself, involves 2 additional modules. For the purpose of making our code concise, we will import the specific modules we want:
rpy2.robjects.pandas2ri: This package is necessary for converting our pandas objects like DataFrames to R dataframe objects.rpy2.robjects.conversion.localconverter: A context manager (see Section 7: Appendix 1) that is required for rpy2 to limit the scope of conversion.# Import a specific DataFrame converter between pandas and R
from rpy2.robjects import pandas2ri
# Import a context manager that allows the conversion to be completed properly (See Appendix 1!)
from rpy2.robjects.conversion import localconverter
# What kind of object is our filtered data?
type(latrine_filtered)
# This is a context manager which is REQUIRED to properly open convert between data frame versions
with localconverter(ro.default_converter + pandas2ri.converter):
r_latrine_filtered = ro.conversion.py2rpy(latrine_filtered)
# What do we have now?
type(r_latrine_filtered)
# What are the properties of our dataframe?
utils.str(r_latrine_filtered)
pandas.core.frame.DataFrame
rpy2.robjects.vectors.DataFrame
'data.frame': 1512 obs. of 6 variables: $ Taxa : chr "Acidobacteria_Gp1" "Acidobacteria_Gp1" "Acidobacteria_Gp1" "Acidobacteria_Gp1" ... $ Country : chr "Vietnam" "Tanzania" "Vietnam" "Vietnam" ... $ Latrine_Number: int 7 4 7 9 22 9 4 22 7 7 ... $ Depth : int 2 5 3 2 1 4 2 1 2 3 ... $ OTUs : int 1 1 7 18 1 38 2 3 1 9 ... $ OTU_log : num 0 0 1.95 2.89 0 ...
<rpy2.rinterface_lib.sexp.NULLType object at 0x000001FB9974BCC0> [RTYPES.NILSXP]
Now that we have our data converted to an R-based data.frame we can proceed to plotting it. Here we'll import the packages we'll need to plot the data and display it:
rpy2.robjects.lib.ggplot2: Rather than import ggplot2 directly, we'll go through the rpy2 package. It's basically a wrapper that calls on the importr() function. Unfortunately we need this to help cover some functionality we can't reach regularly through a command like importr(ggplot2)rpy2.robjects.lib.grdevices: Again a way to access the graphics devices that we'll need to print our plot. This has more to do with rendering our plots in Jupyter Notebook.IPython.display.Image and IPython.display.display: to handle the display of our plots in your Jupyter Notebook.# Import ggplot2 but just let it float in the background. We'll only passively use this.
import rpy2.robjects.lib.ggplot2 as ggplot2rpy2
# Access the graphics display drivers for R
from rpy2.robjects.lib import grdevices
# Access display properties of the Jupyter Notebook
from IPython.display import Image, display
C:\Users\mokca\anaconda3\lib\site-packages\rpy2\robjects\packages.py:366: UserWarning: The symbol 'quartz' is not in this R namespace/package. warnings.warn(
ggplot2¶Now that we've set up all of the necessary components, we can go ahead and plot our data. We'll do this again within the confines of a context manager grdevices.render_to_bytesio(). This will ensure that, should something go wrong, the various connections created will be quietly closed without creating additional memory leaks.
In our context manager, we'll set the width and height of our image as well to produce an object called img that will hold the resulting plot information.
You'll also notice in our code that any ggplot2 functions normally loaded into memory in R, must be called with dot (.) notation in the normal Python style.
# Import the ggplot2 library
ggplot2 = importr('ggplot2')
# Build and display a plot, but pass all that information on to img
with grdevices.render_to_bytesio(grdevices.png, width=700, height=700, res=150) as img:
# Save our plot as an object
# 1. Data
gp = (ggplot2.ggplot(r_latrine_filtered) +
# 2. Aesthetics
# Set the x and y axis data
ggplot2.aes_string(x='Taxa', y='OTUs') +
# Change the x-axis text angle and size
# ggplot2.theme(text = ggplot2.element_text(family = "Arial", size=7)) +
ggplot2.theme_minimal(base_family = "Arial") +
ggplot2.theme(axis_text_x = ggplot2.element_text(angle = 90, size = 7),
text = ggplot2.element_text(size = 7)) +
# 3. Scaling
ggplot2.scale_y_log10() + # Set the y-axis to log scale
# 4. Geom
ggplot2.geom_boxplot()) # Use a boxplot
# Plot the ggplot object
gp.plot()
# Display the plot via the code cell using IPython
display(Image(data=img.getvalue(), format='png', embed=True))
What a wild ride! We used import to bring in or alias eight different modules and components to make our graph. While many of those were use just to display our plot in the Jupyter notebook. While this will certainly not be the case within a simpler Python script, there are still many nuances involved with making this work.
Under the hood of the Jupyter Notebook, there are a number of IPython-specific magics. These are essentially syntactic shortcuts beginning with % that allow you to perform quick operations or directives involving your code. In our case, what we are really interested in is making our life with using R in Python just a little simpler.
%load_ext¶The IPython extensions are just what they sound like. Recall that our Jupyter Notebooks are built on an IPython shell which controls how we access Python and output the code we run. We have already seen an example of modifying this command via the IPython.core.interactiveshell. Extensions add extra modifications to the behaviour of this shell and we can add more using the magic command %load_ext <extension_name>.
To gain access to the rpy2 package and it's capabilities, we can use the extension rpy2.ipython.
# Load the rpy2.ipython extension
%load_ext rpy2.ipython
%R¶Sometimes you may wish to simply slip in some R code between calculations or lines in your Python code. Perhaps you want to access a package and generate a quick calculation in R. Do this with %R to let Python know your intent. Along with this magic comes a series of possible parameters including:
-i <input>: An opportunity to pass in a variable and automatically convert it to an equivalent object in R. This applies to dataframes as well!-o <output>: Prepare to convert an R object from your upcoming code to an equivalently named Python object.-w <width> and -h <height>: The width or height of your plotting device. Especially helpful when generating plots.# Take a look at the data
latrine_filtered.head()
# calculate the mean of OTUs in R and pass it back
%R ... latrine_filtered ... OTU_mean = mean(latrine_filtered$OTUs)
# Now print it in Python
print(OTU_mean)
%%R¶When you are generating or running large portions of code, you can convert the whole code cell to work through rpy2 using the %%R magic. It works with the same parameters as %R to simplify the process.
Let's use this to create a code cell where we plot our data as above. Since we are plotting, we'll set the width and height of our plotting device at the same time.
You'll see, as well, just how simple the process of creating our plot is when using the R-magics. No additional pythonic coding with packages. Just directly coding you would (for the most part) in an R setting. The biggest difference here is we will also provide our pandas.DataFrame object, latrine_filtered, as input. This will automatically be converted by rpy2 for us.
%%R -i latrine_filtered -w 800 -h 600
library(ggplot2)
pl = ggplot(...) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme_grey(base_family = "Arial") +
theme(axis.text.x = element_text(angle = 90)) +
# 3. Scaling
scale_y_log10()+
# 4. Geoms
geom_boxplot()
pl
Yes, our journey is over - for this class. Unfortunately we left a lot out of this course that you might find useful! We barely covered topics like:
And we didn't even touch on things like:
That's our seventh and final class on Python! We worked through a lot of important steps for building and using our own functions. To summarize, we've covered:
rpy2 to combine Python and R.Best of luck with your data science journey!!
At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).
Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 (Best Practices, 750 possible points; Context Managers, 800 possible points; and Decorators, 1050 possible points) from the Writing Functions in Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 1950 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this where it shows the total points earned in each chapter:
| A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible |
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, March 3rd to submit your assignment (right before the next lecture).
Your final project will be due two weeks after this lecture at 13:59 hours on Thursday March 10th. Please submit your final assignment as a single compressed archive which will include:
Please refer to the marking rubric found in your lecture notes for additional instructions and FAQs.
You can build your Jupyter Notebooks on the UofT JupyterHub and save/download a compressed archive to your personal computer for before submitting on Quercus.
Any additional questions can be emailed me to or posted to the Discussion section of Quercus. Best of luck!
Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
In the context of scope and memory, context managers allow you to create variables and manipulate objects in a way that ensures their memory footprint doesn't pollute your global environment. This can happen in a number of cases where Python opens a connection or generates a resource but then you forget to close it or during the course of that code-block an exception or error is encountered. In those cases, you may inadvertantly "lose" the connection but it is not destroyed by Python. This remaining connection produces a "memory leak" as it cannot be properly removed from the programs memory. It may also prevent you from re-opening the same connection or altering a file later outside of Python.
There are two main ways to dealing with this, the try/finally exception structure or a context manager. We won't go over all of the details but in both cases, when you wish to use a resource you usually:
When your code runs, an error at step 2, may skip step 3 altogether. The try/finally statements encapsulate your code to ensure that if an error while trying step 2 occurs, it will still close the file when you are finally done.
Similarly, a context manager is a pre-formed version of this, where common Python actions are coded to catch errors and "recover gracefully". Context managers avoid having to produce additional complicated code and take the form:
with <context-manager>(<args>) as target:
# your code involving target
# code is running inside this context
# This code runs after the context is removed. Notice that this line is not indented
The keyword with signals to Python the use of a context manager. If the context manager produces an object, like a file connect, the optional keyword as assigns the resulting manager to the variable target.
Here is an example of the context manager open to import a file and perform some operations on it.
# Use a context manager to open a file
with open('data/sequences.tsv') as file:
# Read the file in from the connection
file_text = file.read()
file_text_length = len(file_text)
print("The hmp file has {} characters".format(file_text_length))
From the above code, under the hood, when we leave the indentation group and print out our information, the context manager closes the file connection. If for some reason, we called on an incorrect method, we might be returned a warning or exception that could end the program. Before the kernel exits the program, however, the context manager's code would execute an __exit__ method which includes shutting down the file connection. Therefore the file is not occupying space in memory anymore!
Context managers are pre-made functions that have certain assumptions about the order of how actions should be performed. While some are relatively simple like the open command, you may find yourself in a situation where you have specific needs.
You can use context managers to manage paired action patterns that you use often like:
You may be building a function where your expect a certain data format or style. To use context managers inside a function, you need to define it like you would define a function. At the end of the function, use the keyword yield followed by any code that would help clean up the context (optional).
Cleanup code can be a piece of code that, for example, disconnects from a remote database after you are finished using it. An easy way to build context managers is with the @contextlib.contextmanager decorator right above the context manager.
Let's take a look at the format.
# Import the contextLib package
import contextlib
# Use the @ sytax to signify the wrapper function is contextmanager
@contextlib.contextmanager
def my_contextn():
# here goes your code
yield
# add any code that you need to clean up the context
yield statement is where the cleanup happens¶The yield statement tells Python that you want to return a value but that you will finish the rest of the function afterwards. More importantly, this statement saves the state of the local variables. Remember what we talked about with scoping? When we yield a function, it's scope and state are saved. Returning to the function picks back up where it was the yield was called. For a context manager, we can include this call only once. The contextmanager wrapper is coded specifically to look for this structure to initiate your function cleanup.
Here is a more useful example of how to use a context manager. Let's say that you have set your working directory, but sometimes you need to access files outside your current working directory. Let's write a context manager that changes the current working directory, performs a tasks, and changes the path back to your previously set working directory.
# Import the os package
import os
os.getcwd() # current working directory
#import contextlib
# Decorate our function
@contextlib.contextmanager
# Define our function in_dir() which takes a parameter: path
def in_dir(path):
old_dir = os.getcwd() # The current working directory
os.chdir(path) # The alternative directory we want to use
yield # When the contextmanager object uses __exit__, it will come here
# and change the current directory back to the original!
os.chdir(old_dir)
Let's see the context manager in action!
# Use our context manager now using the "with"
with in_dir('../Lecture_06_Regex/data/') as lec_5:
# What is lec_5?
type(lec_5)
# Take the path and list what is in that directory using the os.listdir command
data_files = os.listdir()
# Print the results of that directory
print("Current directory:", os.getcwd(), end="\n\n")
print("Files:", data_files, end="\n\n")
# What was our original directory?
print("Original directory:", os.getcwd())
None object¶From our example above, you can see that our in_dir() context manager returns the None object when we assign it to the variable lec_5. This is apparent in the context manager code itself where it takes in a path and uses the os package to change the directory. Nothing is returned, so nothing is to be assigned to lec_5 afterwards. We could pass it to our call os.listdir(lec_5) instead of os.listdir() but it isn't necessary.
More importantly, the context manager works and your current working directory is intact!
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.